Scaling up visual category recognition to large numbers of classes remainschallenging. A promising research direction is zero-shot learning, which doesnot require any training data to recognize new classes, but rather relies onsome form of auxiliary information describing the new classes. Ultimately, thismay allow to use textbook knowledge that humans employ to learn about newclasses by transferring knowledge from classes they know well. The mostsuccessful zero-shot learning approaches currently require a particular type ofauxiliary information -- namely attribute annotations performed by humans --that is not readily available for most classes. Our goal is to circumvent thisbottleneck by substituting such annotations by extracting multiple pieces ofinformation from multiple unstructured text sources readily available on theweb. To compensate for the weaker form of auxiliary information, we incorporatestronger supervision in the form of semantic part annotations on the classesfrom which we transfer knowledge. We achieve our goal by a joint embeddingframework that maps multiple text parts as well as multiple semantic parts intoa common space. Our results consistently and significantly improve on thestate-of-the-art in zero-short recognition and retrieval.
展开▼